Hierarchical Clustering of Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding

نویسندگان

  • Fionn Murtagh
  • Geoff Downs
  • Pedro Contreras
چکیده

Coding of data, usually upstream of data analysis, has crucial implications for the data analysis results. By modifying the data coding – through use of less than full precision in data values – we can aid appreciably the effectiveness and efficiency of the hierarchical clustering. In our first application, this is used to lessen the quantity of data to be hierarchically clustered. The approach is a hybrid one, based on hashing and on the Ward minimum variance agglomerative criterion. In our second application, we derive a hierarchical clustering from relationships between sets of observations, rather than the traditional use of relationships between the observations themselves. This second application uses embedding in a Baire space, or longest common prefix ultrametric space. We compare this second approach, which is of O(n logn) complexity, to k-means.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hierarchical Clustering for Finding Symmetries and Other Patterns in Massive, High Dimensional Datasets

Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. “Structure” can be understood as symmetry and a range of symmetries are expressed by hierarchy. Such symmetries directly point to invariants, that pinpoint intrinsic properties of the data and of the background empirical domain of interest. We review many aspects of hierarchy ...

متن کامل

Ultrametric Embedding: Application to Data Fingerprinting and to Fast Data Clustering

Data is naturally ultrametric when high dimensional and/or sparse. Local ultrametricity of data can be increased by appropriate data coding. New perspectives on data analysis, and on data analysis algorithms, are opened up by these findings. We discuss data recoding, in order to increase local ultrametricity of data, using time series signals, and texts, of various origins. We then look at how ...

متن کامل

The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering

An ultrametric topology formalizes the notion of hierarchical structure. An ultrametric embedding, referred to here as ultrametricity, is implied by a hierarchical embedding. Such hierarchical structure can be global in the data set, or local. By quantifying extent or degree of ultrametricity in a data set, we show that ultrametricity becomes pervasive as dimensionality and/or spatial sparsity ...

متن کامل

Temporal Hierarchical Clustering

We study hierarchical clusterings of metric spaces that change over time. This is a natural geometric primitive for the analysis of dynamic data sets. Specifically, we introduce and study the problem of finding a temporally coherent sequence of hierarchical clusterings from a sequence of unlabeled point sets. We encode the clustering objective by embedding each point set into an ultrametric spa...

متن کامل

High-Dimensional Unsupervised Active Learning Method

In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • SIAM J. Scientific Computing

دوره 30  شماره 

صفحات  -

تاریخ انتشار 2008